Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management
نویسندگان
چکیده
There has been renewed research interest in using the statistical approach to extraction of key phrases from Chinese documents because existing approaches do not allow online frequency updates after phrases have been extracted. This consequently results in inaccurate, partial extraction. In this paper, we present an updateable PAT-tree approach. In our experiment, we compared our approach with that of Lee-Feng Chien with that showed an improvement in recall from 0.19 to 0.43 and in precision from 0.52 to 0.70. This paper also reviews the requirements for a data structure that facilitates implementation of any statistical approaches to key-phrase extraction, including PATtree, PAT-array and suffix array with semi-infinite strings. A. Introduction — From Information Retrieval to Knowledge Management In this era of the Internet and distributed multimedia computing, new and emerging classes of information technologies have swept into the lives of office workers and everyday people. As such technologies and applications become more overwhelming, pressing, and diverse, solutions for several well-known information technology problems have become an even more urgent need. Information overload, a result of the ease of information creation and representation via Internet and WWW, has become more evident (Blair & Maron, 1985) (Chen, Martinez, et al., 1998). Significant variations in database formats and structures, the richness of information media (text, audio, and video), and an abundance of translingual information content also require different information interoperability (Paepcke et al., 1996) (Lesk, 1997). Several new federal and business initiatives have emerged to attempt to transform our information-glut society into a knowledge-rich society. In the United States NSF Knowledge Networking (KN) initiative, scalable techniques to improve semantic bandwidth and knowledge bandwidth are considered among the priority research areas, as described in the KN report: “The Knowledge Networking initiative focuses on the integration of knowledge from different sources and domains across space and time... KN research aims to move beyond connectivity to achieve new levels of interactivity, increasing the semantic bandwidth, knowledge bandwidth, activity bandwidth, and cultural bandwidth among people, organizations, and communities” (Chen, 1998) (Chen & Ng, 1995). “Knowledge networking,” or a more general term, “knowledge management” (KM), has attracted significant attention from academic researchers and even executives in Fortune 500 companies. O'Leary provides the following definition: “Enterprise knowledge management entails formally managing knowledge resources in order to facilitate access and reuse knowledge, typically by means of advanced technology. KM is formal in that knowledge is classified and categorized according to a pre-specified -but evolving -ontology into structured and semi-structured data and knowledge bases” (O’Leary, 1998). Knowledge management systems may employ various computational techniques, including linguistics analysis, data mining, machine learning, agents, information retrieval, and human-computer interactions. The information technology think tank Gartner Group defines KM as: “a discipline that promotes an integrated approach to identifying, capturing, retrieving, sharing and evaluating an enterprise’s information assets. These information assets may include databases, documents, policies and procedures as well as the uncaptured tacit expertise and experience resident in individual workers” (Gartner Group, 1998). Gartner Group predicts that KM may become the third wave of the Net, making significant impacts on business practices and the US economy in the next century. Since 1997, 30% of Fortune 500 companies have either added a chief knowledge officer (CKO) position or converted the chief information officer (CIO) position into CKO. Many Fortune 500 and IT companies have considered knowledge sharing their most critical strategic area (Davenport, 1995) (Davenport & Prusak, 1998). Although it has been variously defined, it is evident that knowledge management exists at the enterprise level (Davenport & Prusak, 1998) and is quite distinct from mere information (Davenport & Prusak, 1998) (Nonaka, 1994) (Teece, 1998). Also apparent in this area are the challenges that knowledge management poses to an organization. In addition to being difficult to manage, knowledge traditionally has been stored on paper or in the minds of people (Davenport, 1995) (O'Leary, 1998). The KM problems facing many firms stem from barriers to access and utilization resulting from the content and format of information (Jones & Jordan, 1998) (Rouse, Thomas, & Boff, 1998). These problems make knowledge management creation and utilization a complex and daunting process. Nevertheless, new knowledge management technologies have started to emerge in a number of different applications and organizations, such as virtual enterprising (Chen, Liao, & Prasad, 1998), joint ventures (Inkpen & Dinur, 1998), aerospace engineering (Jones & Jordan, 1998), and digital libraries (Chen, 1998) (Chen, Houston, et al., 1998). The just-released PITAC (President’s Information Technology Advisory Committee) report concluded that in the United States “the current Federal program is inadequate to start necessary new centers and research programs... The end result is that critical problems are going unsolved and we are endangering the flow of ideas that have fueled the information economy.” Among the priorities for research, the PITAC report suggests that the federal program should “support fundamental research in capturing, managing, analyzing, and explaining information and in making it available for its myriad of users” (Schatz & Chen, 1999). In order to create a “knowledge map” from diverse information sources, Gartner Group has suggested a bottom-up approach that includes data extraction, linguistic analysis, dictionary/thesaurus creation, semantic networks, clustering/categorization, and concept yellowpages (Gartner Group, 1998). These layers of techniques can serve as the foundation for addressing multimedia and translingual interoperability as well. Other related multimedia processing and translingual indexing and machine translation techniques need to be developed to support additional functionality.
منابع مشابه
Linguistic Knowledge Based Supervised Key - phrase Extraction
The most important information about the content of a document is represented by the key phrases of that document. In this study an automatic key phrase extraction algorithm is devised using machine learning technique. The proposed method not only considers the document level statistics like TFxIDF, the linguistic features of the phrases are also incorporated. Experiment has been performed on N...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملTCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models
This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (...
متن کاملDialogue-Oriented Review Summary Generation for Spoken Dialogue Recommendation Systems
In this paper we present an opinion summarization technique in spoken dialogue systems. Opinion mining has been well studied for years, but very few have considered its application in spoken dialogue systems. Review summarization, when applied to real dialogue systems, is much more complicated than pure text-based summarization. We conduct a systematic study on dialogue-system-oriented review a...
متن کاملAccurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information
In this paper we investigate the impact of candidate terms filtering using linguistic information on the accuracy of automatic keyphrase extraction from scientific papers. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. However the definition of a noun phrase can vary from a system to another. We have identified five POS tag sequence definitions of a noun p...
متن کامل